{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f8f8bef7-2b46-4a1c-b4cf-a76a08c361d8",
   "metadata": {},
   "source": [
    "# L3: Metadata Extraction and Chunking"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcb3ecde-b99a-485f-8a50-b44b479c9e7e",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "523f46f8-a35a-4bc5-aa6c-7270a13e4f72",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Warning control\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96ba771b-4820-4207-9d7d-97aac18c8421",
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "logger = logging.getLogger()\n",
    "logger.setLevel(logging.CRITICAL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ba5e510-91cc-433a-b689-f37f0074de54",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from IPython.display import JSON\n",
    "\n",
    "from unstructured_client import UnstructuredClient\n",
    "from unstructured_client.models import shared\n",
    "from unstructured_client.models.errors import SDKError\n",
    "\n",
    "from unstructured.chunking.basic import chunk_elements\n",
    "from unstructured.chunking.title import chunk_by_title\n",
    "from unstructured.staging.base import dict_to_elements\n",
    "\n",
    "import chromadb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b746665c-479e-4e5b-be9c-59ac5f7aec8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from Utils import Utils\n",
    "utils = Utils()\n",
    "\n",
    "DLAI_API_KEY = utils.get_dlai_api_key()\n",
    "DLAI_API_URL = utils.get_dlai_url()\n",
    "\n",
    "s = UnstructuredClient(\n",
    "    api_key_auth=DLAI_API_KEY,\n",
    "    server_url=DLAI_API_URL,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73ea2b84-528f-4532-bfd3-ac657613b07a",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px\"> 💻 &nbsp; <b>Access Utils File and Helper Functions:</b> To access helper functions and other related files for this notebook, 1) click on the <em>\"View\"</em> option on the top menu of the notebook and then 2) click on <em>\"File Browser\"</em>. For more help, please see the <em>\"Appendix - Tips and Help\"</em> Lesson.</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b7456f6-7d15-404c-a164-409e77714fe6",
   "metadata": {},
   "source": [
    "## Example Document: Winter Sports in Switzerland EPUB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b68c4f65-4bb0-4e9b-9179-4c2bd7572b7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import Image\n",
    "Image(filename='images/winter-sports-cover.png', height=400, width=400)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "892f7b63-ebc4-47fe-a1c6-387edbd06f7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "Image(filename=\"images/winter-sports-toc.png\", height=400, width=400) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "417e7c7b-213c-4a2b-853b-14aa16d9fbe9",
   "metadata": {},
   "source": [
    "## View the content of the file\n",
    "- <a href=\"example_files/winter-sports.pdf\">Winter Sports (View PDF) -- Click Here</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "868a6bbf-a3e7-423b-8631-7ea2a5f399ed",
   "metadata": {},
   "source": [
    "## Run the document through the Unstructured API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc371344-251c-4e0b-bbce-c6e556251b02",
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = \"example_files/winter-sports.epub\"\n",
    "\n",
    "with open(filename, \"rb\") as f:\n",
    "    files=shared.Files(\n",
    "        content=f.read(),\n",
    "        file_name=filename,\n",
    "    )\n",
    "\n",
    "req = shared.PartitionParameters(files=files)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef0322d0-3821-42e7-b626-453e9c64a184",
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    resp = s.general.partition(req)\n",
    "except SDKError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32159484-571c-49c4-b79d-5260ebfb48d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "JSON(json.dumps(resp.elements[0:3], indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3251d41-3ae0-4cf0-9aa5-2c67c2c91ac3",
   "metadata": {},
   "source": [
    "## Find elements associated with chapters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45c16155-cb42-4a76-9767-ffe30bf8ca45",
   "metadata": {},
   "outputs": [],
   "source": [
    "[x for x in resp.elements if x['type'] == 'Title' and 'hockey' in x['text'].lower()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef558156-a62f-4a57-b685-bb5ae27b0795",
   "metadata": {},
   "outputs": [],
   "source": [
    "chapters = [\n",
    "    \"THE SUN-SEEKER\",\n",
    "    \"RINKS AND SKATERS\",\n",
    "    \"TEES AND CRAMPITS\",\n",
    "    \"ICE-HOCKEY\",\n",
    "    \"SKI-ING\",\n",
    "    \"NOTES ON WINTER RESORTS\",\n",
    "    \"FOR PARENTS AND GUARDIANS\",\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6bd8ee0-ad8e-455b-94d0-030d67bb355f",
   "metadata": {},
   "outputs": [],
   "source": [
    "chapter_ids = {}\n",
    "for element in resp.elements:\n",
    "    for chapter in chapters:\n",
    "        if element[\"text\"] == chapter and element[\"type\"] == \"Title\":\n",
    "            chapter_ids[element[\"element_id\"]] = chapter\n",
    "            break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd219155-7197-4200-bc0f-020d05a5e816",
   "metadata": {},
   "outputs": [],
   "source": [
    "chapter_ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49a25bf1-6de1-4159-a92c-237f80b02ed9",
   "metadata": {},
   "outputs": [],
   "source": [
    "chapter_to_id = {v: k for k, v in chapter_ids.items()}\n",
    "[x for x in resp.elements if x[\"metadata\"].get(\"parent_id\") == chapter_to_id[\"ICE-HOCKEY\"]][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f71c4a8a-5333-4eff-8097-e30bb6fb2337",
   "metadata": {},
   "source": [
    "## Load documents into a vector db"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e4eee60-d688-4235-ac63-02ab5efc7904",
   "metadata": {},
   "outputs": [],
   "source": [
    "client = chromadb.PersistentClient(path=\"chroma_tmp\", settings=chromadb.Settings(allow_reset=True))\n",
    "client.reset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec45d0a2-ada6-4380-9eb6-b660aa9d275a",
   "metadata": {},
   "outputs": [],
   "source": [
    "collection = client.create_collection(\n",
    "    name=\"winter_sports\",\n",
    "    metadata={\"hnsw:space\": \"cosine\"}\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f2b4761-43b1-459c-bceb-8fa4d82a48bf",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> ⏳ <b>Note <code>(Wait Time)</code>:</b> The following block can take a few minutes to complete.</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca366ff8-bfb4-4edc-89c5-7147120fadb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "for element in resp.elements:\n",
    "    parent_id = element[\"metadata\"].get(\"parent_id\")\n",
    "    chapter = chapter_ids.get(parent_id, \"\")\n",
    "    collection.add(\n",
    "        documents=[element[\"text\"]],\n",
    "        ids=[element[\"element_id\"]],\n",
    "        metadatas=[{\"chapter\": chapter}]\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a599975-f80c-478c-845e-3aa06ad26147",
   "metadata": {},
   "source": [
    "## See the elements in Vector DB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77247bc5-4e14-44e5-95cc-dd85d7ec4b47",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = collection.peek()\n",
    "print(results[\"documents\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f5415c1-fa22-47f8-aad8-c5db2ebba0ac",
   "metadata": {},
   "source": [
    "## Perform a hybrid search with metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ea74acf-5504-43c1-8020-fee53a057a75",
   "metadata": {},
   "outputs": [],
   "source": [
    "result = collection.query(\n",
    "    query_texts=[\"How many players are on a team?\"],\n",
    "    n_results=2,\n",
    "    where={\"chapter\": \"ICE-HOCKEY\"},\n",
    ")\n",
    "print(json.dumps(result, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "335fd24d-e555-4050-9e83-31d5590bb6aa",
   "metadata": {},
   "source": [
    "## Chunking Content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1292bace-e5d9-49fb-9e6e-b1140cd025b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "elements = dict_to_elements(resp.elements)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb0b29a5-2d8c-4855-ac07-fe29ffb4dfe8",
   "metadata": {},
   "outputs": [],
   "source": [
    "chunks = chunk_by_title(\n",
    "    elements,\n",
    "    combine_text_under_n_chars=100,\n",
    "    max_characters=3000,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39a56fc9-8960-4659-a205-798dd74ff01c",
   "metadata": {},
   "outputs": [],
   "source": [
    "JSON(json.dumps(chunks[0].to_dict(), indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0513d3ce-31e1-43c4-a298-7fed41735bc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "len(elements)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91a6a2f5-e260-4647-a8dc-67b258872b8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "len(chunks)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f4abb16-6fc3-4e82-ad24-3dcdbd45932d",
   "metadata": {},
   "source": [
    "## Work With Your Own Files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd655796-c755-4e50-9239-117529d5335a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import panel as pn\n",
    "#import param\n",
    "from Utils import upld_file\n",
    "pn.extension()\n",
    "\n",
    "upld_widget = upld_file()\n",
    "pn.Row(upld_widget.widget_file_upload)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "063848e5-f5a1-48cb-8d86-b0fce82a5c3d",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> 🖥 &nbsp; <b>Note:</b> If the file upload interface isn't functioning properly, the issue may be related to your browser version. In such a case, please ensure your browser is updated to the latest version, or try using a different browser.</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1eb317e4-d5e2-43e6-aa13-349012fac41a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls ./example_files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a782cf5f-d615-484b-ac53-9a10fc758c90",
   "metadata": {},
   "source": [
    "<p style=\"background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px\"> 💻 &nbsp; <b>Uploading Your Own File - Method 2:</b> To upload your own files, you can also 1) click on the <em>\"View\"</em> option on the top menu of the notebook and then 2) click on <em>\"File Browser\"</em>. Then 3) click on <em>\"Upload\"</em> button to upload your files. For more help, please see the <em>\"Appendix - Tips and Help\"</em> Lesson.</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad4f9fa0-f788-4ae6-bac6-b9298efb205b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "301ff05c-4444-46d1-a386-900ec3aa83d6",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21ea6e7a-734f-4b6e-810d-e2967198bb4f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a013e6a-c444-4348-9929-a7971bd69d3c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1093b1b1-5f6e-4ebd-8a14-476d022e826e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11c9b1f1-c877-46ca-9838-edf7bce63684",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
